Tool Usage

Dataset: cybench_tools.parquet

In this example we visualize tool usage over a series of turns in a Cybench evaluation.

Data Preparation

For analysis we read a raw messages data frame from an eval log1, fill the tool call field with “role” if there are no tool calls, then filter down to just the fields we need for visualization:

from inspect_ai.analysis.beta import messages_df, MessageColumns, SampleSummary

# read messages from log
log = "<path-to-log>.eval"
df = messages_df(log, columns=SampleSummary + MessageColumns)

# mark messages with no tool calls
df.loc[df["tool_call_function"].isna(), "tool_call_function"] = "(none)"

# trim columns
tools_df = df[[
    "eval_id",
    "id",
    "order",
    "tool_call_function",
    "limit"
]]

Note that the trimming of columns is particularly important because Inspect Viz embeds datasets directly in the web pages that host them (so we want to minimize their size for page load performance and bandwidth usage).

Trajectory Analysis

Here we use a cell() mark to visualize tool use over messages in each sample of an evaluation. We note any limit that ended the sample using a text() mark on the right side of the frame.

Code
from inspect_viz import Data
from inspect_viz.plot import plot, legend
from inspect_viz.mark import cell, text

tools = Data.from_dataframe(tools_df)

tools_domain = ["bash", "python", "submit", "(none)"]

plot(
    cell(
        tools,
        x="order",
        y="id",
        fill="tool_call_function",
    ),
    
    text(
        tools, 
        text="limit", 
        y="id",
        frame_anchor="right", 
        font_size=8, 
        font_weight=200,
        dx=50
    ),
    legend=legend("color", location="right"),
    color_domain=tools_domain,
    margin_top=0,
    margin_left=200,
    margin_right=100,
    x_ticks=list(range(0, 400, 50)),
    x_tick_size=4,
    x_label=None,
    y_label=None
)
1
cell() mark showing tool calls.
2
text() mark showing whether the sample terminated due to a limit.
3
Fix the color domian to our pre-set tool ordering.
4
Tweak the margins so the axis labels and text annotations appear correctly.
5
Reduce the number of tick marks on the x-axis.
6
No labels as axes are obvious from tick marks and legand.

Sample Drill Down

Here we stack two plots on top of each other—the original sample-level tool calling plot as well as a bar plot counting the messages which called various tools. You can click on any sample in the top plot to update the bottom plot to count only the tool calls for that sample.

Code
from inspect_viz import Data, Selection
from inspect_viz.plot import plot, legend
from inspect_viz.mark import cell, text, bar_x
from inspect_viz.interactor import toggle_y
from inspect_viz.layout import vconcat
from inspect_viz.transform import count

tools = Data.from_dataframe(tools_df)

click = Selection.single()

vconcat(
    plot(
        cell(
            tools,
            filter_by=click,
            x="order",
            y="id",
            fill="tool_call_function",
        ),
        toggle_y(target=click),
        text(
            tools, 
            filter_by=click,
            text="limit", 
            y="id",
            frame_anchor="right", 
            font_size=8, 
            font_weight=200,
            dx=50
        ),
        legend=legend("color", location="right"),
        color_domain=tools_domain,
        margin_top=0,
        margin_bottom=0,
        margin_left=200,
        margin_right=100,
        x_ticks=list(range(0, 400, 50)),
        x_tick_size=4,
        y_domain="fixed",
        x_domain="fixed",
        x_label=None,
        y_label=None
    ),
    plot(
        bar_x(
            tools, 
            filter_by=click,
            x=count(), 
            y="tool_call_function", 
            fill="tool_call_function",
        ),
        y_label=None,
        y_domain="fixed",
        color_domain=tools_domain[:-1],
        margin_left=50,
        height=170
    )
)
1
Selection used to filter the plot display (A “single” selection filters out all points not in the target selection).
2
Marks are filtered by the “click” selection.
3
The toggle_y() interactor updates the “click” selection with the y-value (“sample”) that has been clicked.
4
Fix the x and y domains so that click selections don’t cause the axes to change.
5
Bar plot is also filtered by the “click” selection.

Footnotes

  1. The eval log read for this example is in the inspect-viz-example-logs repo↩︎